I will analyze the Red Wine Dataset. Key goals of the study are to understand which chemical properties influence the quality of red wines and its correlation among them.
About the data: The red wine data set contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).
Number of Attributes: 11 + output attribute
Attribute information:
Input variables (based on physicochemical tests):
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.4 0.70 0.00 1.9 0.076
## 2 2 7.8 0.88 0.00 2.6 0.098
## 3 3 7.8 0.76 0.04 2.3 0.092
## 4 4 11.2 0.28 0.56 1.9 0.075
## 5 5 7.4 0.70 0.00 1.9 0.076
## 6 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Wine quality is a categorical discrete variable and it ranges from 3 to 8 in the given dataset. There are exceptionally no good or bad wines. Treating the data as continuous will give the mean as 5.636 and median as 6 .
Fixed Acidity appears to be largely positively skewed with a mean of 8.32 and median 7.90.
The mean and median of pH are approximately equal with the values of 3.311 and 3.310 respectively which denotes that pH is normally distributed. Also, a little research online showed that red wines has a pH value range from 3.3 to 3.6.
We plotted the histograms of 11 different chemical properties of red wine to get an idea of the dispersion of each properties. Based on the histograms plotted above, the following observations can be made on the distribution of chemical properties:
Large outliers can be seen in positively skewed and long tailed variables. We will transform some of them to normal distribution by taking log10 which will produce a relatively normal distribution.
There are 1,599 observations with 11 attributes (11 variables on the chemical properties of the wine) + 1 output attribute (quality of red wine).
The quality rating is the main feature in the dataset which defines the good and bad taste of the red wine.
Based on the above distributions, I think that fixed acidity, citric acid, residual sugar, pH, chlorides will be the features of interest.
No.
Some of the distributions were positively skewed and long tailed which I have transformed to produce a relatively normal distribution.
Let’s begin with examining the correlation between two variables using correlation plot.
## Var1 Var2 Freq
## 52 pH citric.acid -0.542
## 130 citric.acid pH -0.542
## 32 citric.acid volatile.acidity -0.552
## 45 volatile.acidity citric.acid -0.552
## 23 density fixed.acidity 0.668
## 92 total.sulfur.dioxide free.sulfur.dioxide 0.668
## 105 free.sulfur.dioxide total.sulfur.dioxide 0.668
## 114 fixed.acidity density 0.668
## 18 citric.acid fixed.acidity 0.672
## 44 fixed.acidity citric.acid 0.672
## 24 pH fixed.acidity -0.683
## 128 fixed.acidity pH -0.683
## 1 X X 1.000
## 16 fixed.acidity fixed.acidity 1.000
## 31 volatile.acidity volatile.acidity 1.000
## 46 citric.acid citric.acid 1.000
## 61 residual.sugar residual.sugar 1.000
## 76 chlorides chlorides 1.000
## 91 free.sulfur.dioxide free.sulfur.dioxide 1.000
## 106 total.sulfur.dioxide total.sulfur.dioxide 1.000
## 121 density density 1.000
## 136 pH pH 1.000
## 151 sulphates sulphates 1.000
## 166 alcohol alcohol 1.000
## 181 quality quality 1.000
## 182 numquality quality 1.000
## 195 quality numquality 1.000
## 196 numquality numquality 1.000
The top 4 chemical properties that are correlated are:
## [1] -0.6829782
2.citric acid & volatile acidity with the correlation coefficient of -0.552 stating that citric acid tends to decrease with increase in volatile acidity
## [1] -0.5524957
3.citric acid & pH with the correlation coefficient of -0.542 (slightly weaker) stating that pH tends to decrease with increase in citric acid
## [1] -0.5419041
## [1] 0.6717034
It can be observed that citric acid is a subset of fixed acidity.
Let us now abserve the boxplots of the selected variables and its median will give a better measure of variance in the dataset.
Higher quality wine tend to have higher alcholol content as compared to low quality wines.
Volatile acidity decreases as the wine grades increases. Volatile acidity is responsible for the smell in wine and too much of it will reduce the wine quality.
Citric acid greatly affects the quality of wine. In low grade red wines, its median is almost pointing to 0 while a well balanced citric acid increases the quality of wine.
Though sulphates are used to maintain the freshness of wines, higher the presence of sulphates in wines, increases the wine graded.
Volatile acidity is responsible for the aroma of wine and is not intentionally included in the wine. It can be observed from the boxplot of volatile acidity and wine grade that higher the volatile acidity, lower is the quality of wine and vice-versa. Also, Higher quality of wine tends to have high level of alcohol. The median for sulphates increases for each wine grade (quality).
When citric acid increases, fixed acidity also increases denoting a positive correlation. Citric acid and volatile acidity are negatively correlated. Citric acid and pH were also negatively correlated – a lower pH indicates a higher acidity.
pH & Fixed Acidity with the correlation coefficient of -0.683.
It can be observed that higher quality wine has lower volatile acidity.
It can be observed that higher quality wine have higher alcohol, lower volatile acidity and higher sulphates.
The multivariate analysis only stregthen the relationship we observed in the bivariate analysis. It depicts that higher quality wine have higher alcohol, lower volatile acidity and higher sulphates.
No
Plot one shows the distribution of wine quality based on the physicochemical tests. It can be observed that the given dataset of red wine contains a large number of wines that are average in quality. The mean and median of the quality of red wines are 5.636 and 6 respectively.
Based on the correlation, the following 4 chemcial properties have the highest correlation coefficient: Alcohol, Volatile Acidity, Citric Acid &Sulphates. Higher the wine grade, higher is the level of alcohol and citric acid. If we group wine grades as bad (3,4), average (5,6) and good (7,8), we can observe that average wines have higher content of sulphates and alcohol in it. Also, the level of sulphates increases slightly in good grade wines which acts as an important role in maintaining the freshness of the wine.
It can be observed that higher quality wine has more alcohol content and less volatile.acidity which means that the quality of wine increases with the increase in alcohol and decrease in volatile acidity.
The key goals of this study were to understand which chemical properties influence the quality of red wines and its correlation among them. The red wine data set contains 1,599 red wines with 11 variables on the chemical properties of the wine. Initially, when I plotted the histograms of all the 11 variables, based on the nature of the plots, I assumed that some of these variables are related to each other like being directly or inversely proportional or subset which turns out to be true in the correlation analysis. The correlation showed that pH tends to decrease with increase in fixed acidity and citric acid and fixed acidity goes hand in hand i.e. they are positively correlated with a value of 0.6717. After doing some web research I learned about a few things about the presence of different chemical properties in wine.
Volatile acidity has a negative correlation. It refers to the acidic elements of the wine that are gaseous rather than liquid. It is the acetic acid compound which is majorly responsible for the aroma. Though it is not intentionally included in the wine,but is an important characterstic in many wines that adds complexity and interest; often in positive manner.
Presence of alcohol plays an important role in determining the quality of wines. Wines having higher level of alcohol provides rich, ripe fruits flavors. Those flavors come from really ripe grapes, and really ripe grapes come from warmer growing conditions.Those grapes contain more sugar, and more sugar produces more alcohol during fermentation.
The presence of sulphates in wine determine its freshness and based on the correlation the level of sulphates increases with increase in wine quality.
Further improvements can be done, if data for exceptionally good and bad wines are present. However, examining the quality of wine is complex and therefore, apart from chemical properties if more factors such as storage duration, quality and types of grapes, etc. are provided the quality of analysis can be improved.
When I plotted the correlation matrix, all the data were overlapped and it looked messy. A google search showed how to show the data as an ordered list and then I created a correlation matrix, transformed it into a dataframe and ordered the data above a certain value to show only relevent values.